AITopics | safe policy

Collaborating Authors

safe policy

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Safe Policy Improvement by Minimizing Robust Baseline Regret

Mohammad Ghavamzadeh, Marek Petrik, Yinlam Chow

Neural Information Processing SystemsMay-1-2026, 05:55:55 GMT

An important problem in sequential decision-making under uncertainty is to use limited data to compute a safe policy, which is guaranteed to outperform a given baseline strategy. In this paper, we develop and analyze a new model-based approach that computes a safe policy, given an inaccurate model of the system's dynamics and guarantees on the accuracy of this model. The new robust method uses this model to directly minimize the (negative) regret w.r.t. the baseline policy. Contrary to existing approaches, minimizing the regret allows one to improve the baseline policy in states with accurate dynamics and to seamlessly fall back to the baseline policy, otherwise. We show that our formulation is NP-hard and propose a simple approximate algorithm. Our empirical results on several domains further show that even the simple approximate algorithm can outperform standard approaches.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

Neural Information Processing Systems

Industry: Energy (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.34)

Add feedback

Safe Policy Improvement by Minimizing Robust Baseline Regret

Neural Information Processing SystemsMar-17-2026, 10:02:04 GMT

An important problem in sequential decision-making under uncertainty is to use limited data to compute a safe policy, i.e., a policy that is guaranteed to perform at least as well as a given baseline strategy. In this paper, we develop and analyze a new model-based approach to compute a safe policy when we have access to an inaccurate dynamics model of the system with known accuracy guarantees. Our proposed robust method uses this (inaccurate) model to directly minimize the (negative) regret w.r.t. the baseline policy. Contrary to the existing approaches, minimizing the regret allows one to improve the baseline policy in states with accurate dynamics and seamlessly fall back to the baseline policy, otherwise. We show that our formulation is NP-hard and propose an approximate algorithm. Our empirical results on several domains show that even this relatively simple approximate algorithm can significantly outperform standard approaches.

artificial intelligence, machine learning, proceedings, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.79)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.61)

Add feedback

OSIL: Learning Offline Safe Imitation Policies with Safety Inferred from Non-preferred Trajectories

Burnwal, Returaj, Bhatt, Nirav Pravinbhai, Ravindran, Balaraman

arXiv.org Machine LearningFeb-12-2026

This work addresses the problem of offline safe imitation learning (IL), where the goal is to learn safe and reward-maximizing policies from demonstrations that do not have per-timestep safety cost or reward information. In many real-world domains, online learning in the environment can be risky, and specifying accurate safety costs can be difficult. However, it is often feasible to collect trajectories that reflect undesirable or unsafe behavior, implicitly conveying what the agent should avoid. We refer to these as non-preferred trajectories. We propose a novel offline safe IL algorithm, OSIL, that infers safety from non-preferred demonstrations. We formulate safe policy learning as a Constrained Markov Decision Process (CMDP). Instead of relying on explicit safety cost and reward annotations, OSIL reformulates the CMDP problem by deriving a lower bound on reward maximizing objective and learning a cost model that estimates the likelihood of non-preferred behavior. Our approach allows agents to learn safe and reward-maximizing behavior entirely from offline demonstrations. We empirically demonstrate that our approach can learn safer policies that satisfy cost constraints without degrading the reward performance, thus outperforming several baselines.

machine learning, reinforcement learning, trajectory, (17 more...)

arXiv.org Machine Learning

2602.11018

Country:

Asia > India > Tamil Nadu > Chennai (0.04)
Europe > Middle East > Cyprus > Pafos > Paphos (0.04)
North America > United States > Rhode Island > Providence County > Providence (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Add feedback

8ec2ba5e96ec1c050bc631abda80f269-Paper.pdf

Neural Information Processing SystemsFeb-9-2026, 21:01:58 GMT

artificial intelligence, machine learning, probability, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Texas > Brazos County > College Station (0.05)
Asia > Middle East > Jordan (0.04)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Energy > Renewable (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

SafeDICE: Offline Safe Imitation Learning with Non-Preferred Demonstrations

Neural Information Processing SystemsDec-27-2025, 03:25:34 GMT

We consider offline safe imitation learning (IL), where the agent aims to learn the safe policy that mimics preferred behavior while avoiding non-preferred behavior from non-preferred demonstrations and unlabeled demonstrations. This problem setting corresponds to various real-world scenarios, where satisfying safety constraints is more important than maximizing the expected return. However, it is very challenging to learn the policy to avoid constraint-violating (i.e.

non-preferred demonstration, offline safe imitation learning, safedice, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.47)

Add feedback

Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs

Neural Information Processing SystemsDec-24-2025, 11:18:48 GMT

We address the issue of safety in reinforcement learning. We pose the problem in an episodic framework of a constrained Markov decision process. Existing results have shown that it is possible to achieve a reward regret of $\tilde{\mathcal{O}}(\sqrt{K})$ while allowing an $\tilde{\mathcal{O}}(\sqrt{K})$ constraint violation in $K$ episodes. A critical question that arises is whether it is possible to keep the constraint violation even smaller. We show that when a strictly safe policy is known, then one can confine the system to zero constraint violation with arbitrarily high probability while keeping the reward regret of order $\tilde{\mathcal{O}}(\sqrt{K})$. The algorithm which does so employs the principle of optimistic pessimism in the face of uncertainty to achieve safe exploration. When no strictly safe policy is known, though one is known to exist, then it is possible to restrict the system to bounded constraint violation with arbitrarily high probability. This is shown to be realized by a primal-dual algorithm with an optimistic primal estimate and a pessimistic dual update.

bounded constraint violation, learning policy, name change, (10 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.77)

Add feedback

Statistically Assuring Safety of Control Systems using Ensembles of Safety Filters and Conformal Prediction

Tabbara, Ihab, Yang, Yuxuan, Sibai, Hussein

arXiv.org Artificial IntelligenceNov-21-2025

Safety assurance is a fundamental requirement for deploying learning-enabled autonomous systems. Hamilton-Jacobi (HJ) reachability analysis is a fundamental method for formally verifying safety and generating safe controllers. However, computing the HJ value function that characterizes the backward reachable set (BRS) of a set of user-defined failure states is computationally expensive, especially for high-dimensional systems, motivating the use of reinforcement learning approaches to approximate the value function. Unfortunately, a learned value function and its corresponding safe policy are not guaranteed to be correct. The learned value function evaluated at a given state may not be equal to the actual safety return achieved by following the learned safe policy. To address this challenge, we introduce a conformal prediction-based (CP) framework that bounds such uncertainty. We leverage CP to provide probabilistic safety guarantees when using learned HJ value functions and policies to prevent control systems from reaching failure states. Specifically, we use CP to calibrate the switching between the unsafe nominal controller and the learned HJ-based safe policy and to derive safety guarantees under this switched policy. We also investigate using an ensemble of independently trained HJ value functions as a safety filter and compare this ensemble approach to using individual value functions alone.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

2511.07899

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

SafeMIL: Learning Offline Safe Imitation Policy from Non-Preferred Trajectories

Burnwal, Returaj, Bhatt, Nirav Pravinbhai, Ravindran, Balaraman

arXiv.org Machine LearningNov-17-2025

In this work, we study the problem of offline safe imitation learning (IL). In many real-world settings, online interactions can be risky, and accurately specifying the reward and the safety cost information at each timestep can be difficult. However, it is often feasible to collect trajectories reflecting undesirable or risky behavior, implicitly conveying the behavior the agent should avoid. We refer to these trajectories as non-preferred trajectories. Unlike standard IL, which aims to mimic demonstrations, our agent must also learn to avoid risky behavior using non-preferred trajectories. In this paper, we propose a novel approach, SafeMIL, to learn a parame-terized cost that predicts if the state-action pair is risky via Multiple Instance Learning. The learned cost is then used to avoid non-preferred behaviors, resulting in a policy that prioritizes safety. We empirically demonstrate that our approach can learn a safer policy that satisfies cost constraints without degrading the reward performance, thereby outperforming several baselines.

machine learning, reinforcement learning, trajectory, (18 more...)

arXiv.org Machine Learning

2511.08136

Country:

North America > United States (0.14)
Asia > India (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.95)
Information Technology > Artificial Intelligence > Robots (0.90)

Add feedback

Probabilistic Safety Guarantee for Stochastic Control Systems Using Average Reward MDPs

Omidi, Saber, Petrik, Marek, Yoon, Se Young, Begum, Momotaz

arXiv.org Artificial IntelligenceNov-12-2025

Safety in stochastic control systems, which are subject to random noise with a known probability distribution, aims to compute policies that satisfy predefined operational constraints with high confidence throughout the uncertain evolution of the state variables. The unpredictable evolution of state variables poses a significant challenge for meeting predefined constraints using various control methods. To address this, we present a new algorithm that computes safe policies to determine the safety level across a finite state set. This algorithm reduces the safety objective to the standard average reward Markov Decision Process (MDP) objective. This reduction enables us to use standard techniques, such as linear programs, to compute and analyze safe policies. We validate the proposed method numerically on the Double Integrator and the Inverted Pendulum systems. Results indicate that the average-reward MDPs solution is more comprehensive, converges faster, and offers higher quality compared to the minimum discounted-reward solution. Keywords: Safety Critical Systems, Robotics, Average Reward MDPs, Stochastic Control.

artificial intelligence, machine learning, probability, (14 more...)

arXiv.org Artificial Intelligence

2511.08419

Country: North America > United States (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.49)

Add feedback

SAC-Loco: Safe and Adjustable Compliant Quadrupedal Locomotion

Zhang, Aoqian, Zhuang, Zixuan, Wang, Chunzheng, Ge, Shuzhi Sam, Shi, Fan, Xiang, Cheng

arXiv.org Artificial IntelligenceSep-30-2025

Quadruped robots are designed to achieve agile locomotion by mimicking legged animals. However, existing control methods for quadrupeds often lack one of the key capabilities observed in animals: adaptive and adjustable compliance in response to external disturbances. Most locomotion controllers do not provide tunable compliance and tend to fail under large perturbations. In this work, we propose a switched policy framework for compliant and safe quadruped locomotion. First, we train a force compliant policy with adjustable compliance levels using a teacher student reinforcement learning framework, eliminating the need for explicit force sensing. Next, we develop a safe policy based on the capture point concept to stabilize the robot when the compliant policy fails. Finally, we introduce a recoverability network that predicts the likelihood of failure and switches between the compliant and safe policies. Together, this framework enables quadruped robots to achieve both force compliance and robust safety when subjected to severe external disturbances.

artificial intelligence, external force, robot, (17 more...)

arXiv.org Artificial Intelligence

2509.23223

Country: Asia (0.14)

Genre: Research Report (0.40)

Industry: Energy (0.47)

Technology: Information Technology > Artificial Intelligence > Robots > Locomotion (0.71)

Add feedback